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We analyze the Agatston score of coronary artery calcium (CAC) 
from the Multi-Ethnic Study of Atherosclerosis (MESA) using the 
semiparametric zero-inflated modeling approach, where the observed 
CAC scores from this cohort consist of high frequency of zeroes and 
continuously distributed positive values. Both partially constrained 
and unconstrained models are considered to investigate the underly- 
ing biological processes of CAC development from zero to positive, 
and from small amount to large amount. Different from existing stud- 
ies, a model selection procedure based on likelihood cross-validation is 
adopted to identify the optimal model, which is justified by compar- 
ative Monte Carlo studies. A shrinkaged version of cubic regression 
spline is used for model estimation and variable selection simulta- 
neously. When applying the proposed methods to the MESA data 
analysis, we show that the two biological mechanisms influencing the 
initiation of CAC and the magnitude of CAC when it is positive are 
better characterized by an unconstrained zero-inflated normal model. 
Our results are significantly different from those in published stud- 
ies, and may provide further insights into the biological mechanisms 
underlying CAC development in humans. This highly flexible statis- 
tical framework can be applied to zero-inflated data analyses in other 
areas. 

1. Introduction. The Multi-Ethnic Study of Atherosclerosis (MESA) 
[Bild et al. (2002)] is an ongoing longitudinal study of subclinical cardiovas- 
cular disease (CVD) involving a cohort of more than 6500 men and women 
from six communities in the United States (http : / /www.mesa-nhlbi . org/). 
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It was initiated by the National Heart, Lung and Blood Institute in July 2000 
to investigate the prevalence, risk factors and progression of subclinical CVD 
in a population-based multi-ethnic cohort. Agatston score [Agatston et al. 
(1990)], which measures the amount of coronary artery calcium (CAC), is 
an important predictor of future coronary heart disease events [Min et al. 
(2010); Polonsky et al. (2010)]. However, many healthy people may have 
no detectable CAC; consequently, CAC equals zero with substantial relative 
frequency, but otherwise it is a continuous positive variable. That CAC has 
a mixture distribution with an atom at zero hampers its analysis by stan- 
dard statistical methods. Such data are referred to as "zero-inflated" and 
require the development of more complex statistical models. 

Zero-inflated data actually abound in many areas, for example, in health 
care cost studies [Blough, Madden and Hornbrook (1999)], environmental 
science [Agarwal, Gelfand and Citron- Pousty (2002)], ecological applications 
[Liu et al. (2011)], etc. Among various models for analyzing data with ex- 
cess zeroes, the hurdle model [Mullahy (1986)] has been proposed to han- 
dle both zero-inflation and zero-deflation in count data, which consists of 
two parts: one binary model to determine whether the response outcome is 
zero or positive and a second part conditional on the positive responses if 
the "hurdle is crossed." On the other hand, a zero-inflated model [Lambert 
(1992)] that assumes an underlying mixture distribution of probability mass 
at zero and some continuous or discrete distribution (e.g., normal, Poisson) 
has been widely used to analyze zero-inflated continuous data [Couturier 
and Victoria- Feser (2010)]. Note that both the hurdle model and the zero- 
inflated model are essentially equivalent to the two-part model [Kronmal 
(2005); Welsh and Zhou (2006)] when dealing with zero-inflated continuous 
data as the CAC score in MESA [see Min and Agresti (2005) for discus- 
sion on comparing existing models for zero-inflated count data]. Therefore, 
we will not distinguish the aforementioned two models and refer to the ap- 
proach as the zero-inflated model in the following discussion. Also note that 
the two-part model for zero-inflated continuous data, with the probit link for 
the binary model part, is a special case of the Heckman model [also known 
as Type II Tobit model, see Heckman (1979); Amemiya (1984)]. Most ex- 
isting zero-inflated models are in the parametric setting, assuming that the 
covariate effects are linear (on proper link scales). However, the assump- 
tion of linearity may not hold in public health or medical research. Instead, 
the semiparametric regression model [Ruppert, Wand and Carroll (2003)] 
provides a powerful tool to describing nonlinear relationships between the 
covariates and response variables in such situations. For instance, Lam, Xue 
and Cheung (2006) used the sieve estimator to analyze zero-inflated count 
data from a public health survey. 

In zero-inflated data analyses, it is often of interest to examine whether 
the zero and nonzero responses are generated by related mechanisms. In 
MESA, it may provide useful insights into the biological process on whether 
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or not the risk factors of CVD influence the probability of having posi- 
tive CAC and the progression of CAC when it is present in a similar way, 
which could be statistically verified by introducing proportional constraints 
into the zero-inflated model. Such a constrained zero-inflated model can be 
interpreted by a latent biological mechanism involving an unobservable ran- 
dom threshold and has been studied mostly in a parametric framework. For 
example, Han and Kronmal (2006) considered proportional constraints in 
two-part models in MESA to promote better understanding of the mech- 
anism that drives the zero-inflation in CAC, and to estimate the model 
parameters more accurately (intuitively because fewer parameters need to 
be estimated in a constrained model) as well. However, they did not take 
into account the nonlinear relationships between some covariates and the 
response variable in MESA [McClelland et al. (2006)]. Ma et al. (2010) 
incorporated proportional constraints in a semiparametric zero-inflated nor- 
mal model when analyzing the same data set, but they only considered 
a universal proportionality parameter on all covariates, which is not flexible 
enough to handle more complicated zero-inflation processes (see Section 2 for 
more discussion). Therefore, it becomes necessary to study a more flexible 
partially constrained semiparametric zero-inflated model to overcome the 
limitations of the existing investigations. We note that similar techniques 
of imposing proportional constraints on two sets of regression coefficients in 
complex models were investigated by Albert, Follmann and Barnhart (1997), 
Moulton, Curriero and Barroso (2002), among others. 

In this paper, we propose a partially constrained semiparametric zero- 
inflated model to analyze the CAC score in MESA, which provides a highly 
flexible approach for delineating the zero-inflated data generating process. 
Under the general partially constrained model framework, the unconstrained 
and constrained zero-inflated models together make it possible to shed new 
light on the relationship between the zero and nonzero data generating pro- 
cesses, and the latter promotes estimation efficiency when the postulated 
constraint holds. Cubic regression spline with shrinkage is adopted to esti- 
mate nonparametric regression functions and to select important variables 
simultaneously. Because of the complex model specification with a mixture 
distribution, a model selection procedure based on cross- validated likelihood 
is implemented to examine the prediction performance of the fitted models, 
and to choose the optimal zero-inflated model from multiple candidate mod- 
els with various partial proportional constraints, which avoids the problem 
of multiple testing by treating each candidate model on an equal basis. Es- 
timation of the proposed zero-inflated model and statistical inference will 
also be discussed. The outline of this paper is as follows. We introduce 
the semiparametric zero- inflated model methodology in Section 2. Simula- 
tion studies are carried out to illustrate the proposed model estimation and 
selection methods in Section 3. The analytical results of the MESA data 
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analysis are presented in Section 4. Some concluding remarks are discussed 
in Section 5. 

2. Methods. 

2.1. Semiparametric zero-inflated model. Statistical analysis of zero- 
inflated data cannot proceed under the assumption of regular probability 
distribution due to the high frequency of zeroes. If the nonzero responses 
are continuously distributed, a zero-inflated normal (ZIN) model can be uti- 
lized, which assumes a mixture distribution of probability mass at zero and 
a normal distribution, after suitable transformation. Suppose that given the 
covariate vectors Z = (Z±, . . . , Z m )' and X = (X\, . . . , Xf.)' , the conditional 
distribution of the response variable Y is zero-inflated normal: 

(2 1) y\z x~ / °' with P robabilit y (! ~y)> 

yj\f(fi,a 2 ), with probability p, 

where the covariate effects of Z are parametric and those of X are nonpara- 
metric. The above ZIN model consists of two parts: 

k 

(2.2) g(p)=p + (3'Z + ^h i (X i ) 

i=i 

links the nonzero-inflation probability p to the covariates via a link function g 
(e.g., logit or probit function) in the binary part, and the linear part 

k 

(2.3) /i = 7o + 7'Z + ^s l (X i ) 

8=1 

describes the covariate effects on the normal mean response [i. In the semi- 
parametric setting, /5o and 70 are two intercept terms, the regression co- 
efficients P = (/Si, . . . , P m )' and 7 = (71, . . . ,7m)' correspond to the para- 
metric effects in the two parts, respectively, and hi,Si,i = 1, . . . ,k, are two 
sets of nonparametric smooth functions. By setting some parametric coeffi- 
cients and/or some smooth functions to be identically zero, equations (2.2) 
and (2.3) subsume the case that the two parts of the model involve differ- 
ent sets of covariates. Each univariate smooth function hi(Xi) or Si(Xi),i = 
1, . . . , k, can be estimated nonparametrically using a cubic regression spline, 
which can be readily extended to a high-dimensional smoother using thin 
plate spline [Wood (2003)] to accommodate interaction between several con- 
tinuous predictor variables. 

Equations (2.1), (2.2) and (2.3) formulate an unconstrained semiparamet- 
ric ZIN model, which assumes that the covariate effects on the probability 
of having a nonzero response and the magnitude of the nonzero response 
may follow different data generating mechanisms. However, an interesting 
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research question arises as to whether the two processes are related to some 
extent such that some covariates influence the two processes similarly. The 
partially constrained zero-inflated modeling approach [Liu and Chan (2011)] 
could be used to test the above hypothesis, which assumes that some of the 
smooth components (operating on the same covariates) in (2.2) and (2.3) 
bear proportional relationships with the constraints: 

(2.4) hi = S iSi , ietf c{i,...,fc}, 

where ^ is the index set of the constrained smooth components; <5j, i E'if, are 
unknown proportionality parameters. The covariates corresponding to those 
smooth functions with proportional constraints then affect the nonzero- 
inflation probability and the mean nonzero response proportionally on the 
link scales. However, the other covariates with indices not in may have 
different impacts on the above two processes, which can be flexibly modeled 
by the unconstrained components. Note that the unconstrained zero-inflated 
model is a special case in the general partially constrained model framework 
with C € = Z. 

We consider proportional constraints in the zero-inflated model not only 
because they may result in more parsimonious models, but also because 
they may admit biological interpretation connected to some latent threshold 
model. To illustrate this connection, suppose that Y* is a latent response 
variable following the M(p,a 2 ) distribution. The observed response Y is 
zero if the latent mean response p is less than a random threshold T which 
could be due to measurement error or limits of detection, and it is equal 
to Y* if p exceeds the threshold. Hence, the nonzero-inflation probability 
p = Pr(y = Y*) = Pr(T < p) = Ft(p), where Ft is the cumulative distri- 
bution function (CDF) of the random threshold variable T. As a result, 
we would have g(p) = p if the link function is taken as the inverse CDF 
of T, which is, however, generally unknown. Nevertheless, according to Li 
and Duan (1989), under some mild regularity conditions, any maximum 
likelihood-type estimator is consistent up to a multiplicative scalar, even 
under a misspecified link function. More specifically, if we use, for exam- 
ple, a logit link in (2.2), the parameter estimators in the binary part are 
proportional to the true parameters in (2.3), that is, (3 = 5~f, and hi = 6si, 
i = 1, . . . ,k, for some scalar 5. Alternatively, assuming the zero-inflation is 
caused by some other biological characteristic depending on the covariates 
through £(Z,X), that is, Y = if £(Z,X) < T, we may have partial propor- 
tionality among the parameters in the two parts. Based on this latent biolog- 
ical process, we further relax the proportionality parameter to be possibly 
different across the linear and smooth components, leading to the proposed 
partially proportionally constrained zero-inflated model. 

As a closely relevant study in the literature, Ma et al. (2010) compared the 
unconstrained semiparametric zero-inflated model to a fully proportionally 
constrained model, which assumed (2.2) and p = a + t{(3'Z + J2i=i h>i(Xi)}, 
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with t being the universal proportionality scale parameter. The fully con- 
strained model is, however, quite inflexible and it cannot handle the cases 
with nonidentical sets of covariates in (2.2) and (2.3) or more complicated 
zero-inflation mechanisms [e.g., £(Z,X) 7^ /i as discussed above]. The par- 
tially constrained semiparametric zero-inflated model, on the other hand, is 
more flexible by untangling the constrained and unconstrained smooth com- 
ponents. In addition, when the postulated proportional constraint holds, the 
more parsimonious partially constrained zero-inflated model promotes esti- 
mation efficiency compared to its unconstrained counterpart. Compared to 
a more recent study by Liu et al. (2011) on similar problems in a constrained 
semiparametric two-part model, our method is computationally more afford- 
able and more flexible. In this study, we shall focus on the statistical inference 
and model selection regarding proportional constraints on the nonparamet- 
ric smooth components, which has not been discussed in the literature to 
our knowledge. Moreover, the estimation and inference methods proposed 
below could be readily lifted to the cases where the parametric terms are 
also (partially) constrained. 

2.2. Model estimation and inference. The proposed semiparametric zero- 
inflated model can be estimated by the penalized likelihood approach, which, 
in the unconstrained case, maximizes the following penalized log-likelihood 
function: 

k k 

F n £(Po,P, 70, 7, o~ 2 , h i, • • • , h k , si, . . . , s k ) - ^ J (hi) - ^ cp^J (s^, 

i=i i=i 

where P n is the empirical measure of n observations, £ = I(Y = 0) log(l — 

p) + I(Y ^ 0){logp— ^ Y 2 Jz } is t ne log-likelihood function for a single obser- 
vation, J(f) defines a roughness penalty functional of /, and X n ,i,fn,i,i = 
1, . . . , k, are the smoothing parameters corresponding to each penalty term, 
which control the trade-off between the smoothness of the function esti- 
mates and goodness of fit of the model. In this study, cubic regression spline 
is adopted with roughness penalty J(f) = J {f^ 2 \x)} 2 dx, where f( 2 \x) de- 
notes the second derivative of a univariate function f(x). The spline esti- 
mate can be represented as a linear combination of some basis functions: 
f(x) = 6» + 0\x + Y^j=i 8j+i( x ~ x *j)+i where x*,j = 1, . . . , K - 1, are fixed 
knots placed evenly (in terms of percentiles) over the corresponding ob- 
served covariate values [see Durrleman and Simon (1989) for more discus- 
sion on the knots selection in cubic splines], (x)+ = x if x > and (x) + = 
otherwise, = (9o, . . . , 6k)' is the parameter vector. Accordingly, the rough- 
ness penalty could be written as a quadratic form of the corresponding 
parameters, such that J(f) = 6'SO, where S is the penalty matrix. The 
smoothing parameters can be selected by generalized cross-validation (GCV) 
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or similar procedures. Under the main regularity conditions as following: 
(Rl) The covariates {Z,X} and the true parametric coefficients /3o, A 70, 
7 are bounded; (R2) hi,Si are nonconstant and satisfy J(/ij), J(sj) < oo; 
(R3) A nj i,9? ni j = Op(n~ 2 / 5 ); (R4) The Fisher information matrix is nonsin- 
gular, plus some minor technical conditions, the maximum penalized likeli- 
hood estimators of the smooth functions can be shown to be n 2 / 5 consistent, 
and the parametric coefficient estimators are n 1 / 2 consistent and asymptot- 
ically normal, using similar empirical processes techniques in Liu and Chan 
(2011). 

Statistical inference including construction of the confidence intervals 
for the parametric coefficients and confidence bands for the smooth func- 
tions can be based on the observed Fisher information matrix, which avoids 
computer-intensive bootstrap methods used in Ma et al. (2010) and Liu 
et al. (2011). Monte Carlo studies reported in Liu and Chan (2011) showed 
that such confidence intervals/bands enjoyed desirable empirical properties 
in that their across-the-function coverage rates were close to their nominal 
levels. Estimation and inference of a partially constrained semiparametric 
zero- inflated model follow a similar procedure. More details of the estimation 
algorithm and theoretical results can be found in Liu and Chan (2011). 

As pointed out by Wood (2006), a disadvantage of the cubic spline smoother 
is that the estimated smooth is never completely eliminated in the sense of 
having all corresponding parameters estimated to be zero. In addition, linear 
components in the smooth function are always unpenalized by the second 
derivative penalty. From a variable selection point of view [Huang, Horowitz 
and Wei (2010)], it may be desirable to have the smooth be shrunk com- 
pletely to zero if the corresponding smoothing parameter is sufficiently large, 
and preserve the curvature otherwise. Wood (2006) proposed to add an extra 
small amount of ridge-type of penalty to the original penalty matrix, that is, 
S £ = S + el was used as the penalty matrix with additional shrinkage. The 
parameters of a smooth function with large smoothing parameter are set 
to be exactly zero. But otherwise the additional small fraction of an iden- 
tity matrix has almost no influence on the cubic spline estimate if it is not 
shrunk to linearity by the roughness penalty. With this slight adjustment, 
the resulting cubic smooth with additional shrinkage behaves reasonably 
well in variable selection empirically, which is illustrated in a simulation 
study in Section 3. In the following discussion, the cubic regression spline 
with shrinkage is adopted to estimate the nonparametric covariate effects as 
well as to select relevant variables simultaneously. 

2.3. Partial- constraint selection. One remaining issue with the partially 
constrained zero-inflated model is to choose an optimal model in terms of 
prediction performance from multiple candidate models with various partial 
constraints [model (2.1) to (2.3) with constraint (2.4), note that different 
index sets ^ correspond to different partially constrained models, including 



8 



LIU, MA, KRONMAL AND CHAN 



^ = 0, that is, the unconstrained model] and justify the selection proce- 
dure. Liu and Chan (2011) proposed a model selection criterion for a non- 
parametric zero-inflated model based on the marginal likelihood, which is 
similar to the Bayesian information criterion (BIC) [Schwarz (1978)]. How- 
ever, although the marginal likelihood criterion was shown to work well for 
zero-inflated model selection both theoretically and empirically, it was de- 
rived for penalized regression splines without additional shrinkage. Little is 
known about its behavior when applied to the shrinkaged version of cubic 
spline, as we adopted in this study. Instead, cross-validation works almost 
universally [Shao (1993)] for most model selection purposes, which assesses 
the prediction performance of the models under comparison. The model se- 
lection method is easier to implement in practice than the hypothesis testing 
approach used in other studies [see, e.g., Han and Kronmal (2006)], which 
usually involves step-wise search and whose complexity increases dramati- 
cally with the number of candidate models. 

Among a variety of cross-validation methodologies [Arlot and Celisse 
(2010)], we use the Monte Carlo cross-validation (MCCV) [Picard and Cook 
(1984)] to examine the out-of-sample prediction performances of various 
partially constrained zero-inflated models under consideration. In particu- 
lar, the data are randomly partitioned into two disjoint sets, one of which 
with a fixed fraction 1 — v of the whole data (training set) are used to build 
the model, and the remaining v fraction of the data (validation set) are 
used to evaluate some goodness-of-fit criterion (or, equivalently, the risk) for 
each candidate model. The partition is repeated independently for B times 
and the out-of-sample prediction performance of each model is estimated 
by taking the average over the B validation sets. Furthermore, because of 
the complexity of the mixture zero-inflated distribution, the goodness-of-fit 
criterion need to be chosen with caution. We propose to use cross-validated 
likelihood as the prediction performance criterion, which is advocated in 
a probabilistic clustering problem using mixture modeling [Smyth (2000)]. 
The cross-validated (log-)likelihood of the fcth candidate model is defined as 

1 B ~ 

i=i 

where D denotes the original data, DJ is the validation set of the jth parti- 
tion, Qk(D\D^) is the maximum penalized likelihood estimator of the model 
parameter for the kth candidate model estimated from the jth training set, 
and I is the (log-)likelihood function evaluated on .DJ. It can be shown that 
the expected value of the likelihood evaluated on an independent validation 
data set is related to the Kullback-Leibler divergence between the truth and 
the model under consideration [Smyth (2000)]. 

Other possible model selection criteria include the mean squared error 
(MSE) of the nonzero responses, and the area under the receiver operating 
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characteristic (ROC) curve [AUC, larger is preferred as it indicates better 
prediction, see, e.g., Miller, Hui and Tierney (1991)] of the binary indicators 
of zero responses. However, both MSE and AUC have limitations when ap- 
plied to zero-inflated data. In particular, the MSE only measures the risk for 
the nonzero responses, whereas the AUC takes into account all validation 
samples, but it fails to assess the accuracy of the predictive value of the 
nonzero response. Sometimes the two criteria may point to different can- 
didate models, which confounds the model selection. In addition, the bias- 
corrected MSE [denoted as MSE C , it is not difficult to see that E(Y) =pfi 
from (2.1)] of both the zero and nonzero data can be calculated for each 
validation set as 



where yj is the ith. observed response in the validation set with n v samples, 
Pi and jli are the corresponding estimated nonzero-inflation probability and 
mean nonzero-inflated response from the fitted model, respectively. A sim- 
ulation study is carried out to evaluate the performance in model selection 
between the partially constrained and unconstrained zero-inflated models 
based on the cross- validated likelihood, AUC, MSE and MSE C in Section 3. 
In practice, it suffices to set the MCCV replication size B to some num- 
ber between 20 and 50 in most model selection problems in the parametric 
framework [Shao and Tu (1995)]. In the semiparametric setting, as in our 
study, we repeat the partition with equal size (y = 0.5) for B = 100 times in 
the simulation study and B = 200 in the MESA data analysis because of its 
much larger sample size. 

3. Simulation study. Before applying the zero-inflated modeling approach 
to the MESA data analysis, we first conduct a Monte Carlo study to exam- 
ine the performance of the proposed model selection method based on the 
cross-validated likelihood, as well as other goodness-of-fit criteria discussed 
in the previous section. The out-of-sample prediction performance is evalu- 
ated for two candidate models, that is, the partially constrained zero-inflated 
normal model (with the correct model specification) and its unconstrained 
counterpart. 

The simulated data were generated based on three univariate test func- 
tions si, S2 and S3 on [0, 1]: 



First, n independently uniformly distributed random variables X\, X2 and X3 
were generated on [0, 1] . A two-level factor covariate Z was set to be for the 




si(x) 

S2(x) 
S3(x) 



{0.2x n (10(l - x)f + 10(10x) 3 (l - x) 10 }/4 

2sin(7rx), 

exp(3s)/10. 
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first n/2 samples and 1 for the rest. The true nonzero-inflation probability p 
and nonzero-inflated mean response \x were generated by 

(3.1) logit(p) = 0.3Z + 0.5si(Xi) + s 2 (X 2 ), 

(3.2) » = -l + 2Z + s l (X 1 ) + s 3 (X 2 ), 

where each smooth component was centered at the observed covariate values 
and denoted as Sj,j = 1,2,3, respectively. The nonzero-inflated responses 
Y*, i = 1, . . . , n, were randomly sampled from normal M(fii,cr) distributions. 
The response variable was then "zero-inflated" according to the indicator 
random variables Ei, which followed independent Bernoulli (pi) distributions, 
that is, Yi = Y* if Ei = 1 and Y{ = if Ei = 0. The simulated data set is 
denoted as {(Yt, Zi, Xn, Xi 2 , Xis)}2 =1 . Note that the above data simulation 
procedure specifies a partially constrained ZIN with proportional constraint 
on the s\(Xi) component. But the covariate X 2 affects the probability of 
nonzero inflation and the mean nonzero response with different functional 
forms, namely, s 2 and S3 in equations (3.1) and (3.2), respectively. X% is 
a redundant covariate that has no impact in either p or fi. 

For each simulated data set, we fitted the partially constrained ZIN and 
the unconstrained counterpart, with nine evenly spaced knots for each cubic 
spline. We examined seven sample sizes from n = 200 to 800, with two noise 
levels a = 0.5 and 1. Figure 1 shows the smooth function estimates by the 
partially constrained ZIN fitted to one simulated data set with n = 400 
and a = 0.5. The estimated smooth functions by the unconstrained ZIN 
fitted to the same data set are displayed in Figure 2. The wide confidence 
band (i.e., large standard errors of the smooth function estimates) in the 
upper left panel of Figure 2 suggests the lack of efficiency in estimating the 
logistic part (3.1) by the unconstrained ZIN as compared to the constrained 



Linear (Constrained) Linear (Constrained) Logistic (Constrained) 




0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 

x1 x2 x2 



Fig. 1. Estimated smooth functions fitted by the partially constrained zero-inflated nor- 
mal model, with n — 400 and a = 0.5. The solid lines show the cubic regression spline 
estimates, with the dashed lines representing the 95% point-wise confidence bands. The 
gray dots denote the true functional values. 
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model, if the data were generated from the constrained model [see Liu and 
Chan (2011), Section 5, for more discussion on the estimation efficiency]. It 
is worthwhile to mention that in both the constrained and unconstrained 
models, the covariate effect of the redundant variable X3 was completely 
shrunk to zero by the cubic regression splines with shrinkage. 

MCCV were conducted for B = 100 times with u = 0.5 to evaluate the 
cross-validated likelihood, AUC, MSE and MSE C for the constrained and 
unconstrained ZIN models. In each of the aforementioned settings, 500 repli- 
cations were performed and the success rates in selecting the true model by 
each of the criteria were compared and summarized in Figure 3. As expected, 
the success rates of selecting the true model generally increase with the sam- 
ple size for each criterion. Except for very small sample sizes (n = 200,300, 
note that they were zero- inflated data with nearly 50% zeroes), the cross- 
validated likelihood outperforms all other three criteria. Especially for MSE 
of the nonzero data, the success rates are significantly lower than the other 
three in each scenario, which suggests that it is not a reliable measure of the 
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Fig. 3. Model selection performance of the cross-validated likelihood, AUG, MSE and 
MSE C . 

overall prediction performance for zero-inflated model. On the other hand, 
the bias-corrected version of MSE performs reasonably well. By comparing 
the levels of error variance, the success rates are observed to be consistently 
higher across different sample sizes for a = 0.5 (left panel of Figure 3) than 
<T = 1 (right panel), except for MSE. This seemingly implausible phenomena 
of MSE may be explained by the bias-variance trade-off of the imposed con- 
straint in the zero- inflated model. More specifically, the constrained model is 
more parsimonious and hence has smaller estimation variance as compared 
to the unconstrained model, which may also introduce bias. When the error 
variance is reduced, the bias becomes more dominant than the estimation 
variance in the MSE decomposition. Therefore, the unconstrained model 
tends to be more favored by MSE when the error variance is smaller. 

We also remark that the average discrepancies of all criteria between the 
unconstrained and correctly specified constrained models decrease as the 
sample size increases, which suggests that the relative predictive gain by 
the constrained ZIN diminishes with increasing sample size. This is not sur- 
prising because as the sample size increases, the estimation error becomes 
smaller relative to the intrinsic variabilities in the data. So if the data were 
truly generated from a partially constrained zero-inflated model and the 
sample size is large, we would benefit not as much on the estimation effi- 
ciency by fitting a constrained model as for small to moderately large sample 
sizes. In situations where there are too few observations to carry out an un- 
constrained semiparametric or nonparametric zero-inflated model analysis, 
fitting a partially constrained model may provide an elegant alternative — 
a perspective earlier advanced by Lambert (1992) within the parametric 
zero-inflated framework. 
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In summary, as illustrated by the Monte Carlo study, the cross-validated 
likelihood performs very well in selecting the true model with over 90% 
success rate under mid to large sample sizes (n > 400), which provides strong 
justification for the proposed model selection procedure. 

4. MESA data analysis. 

4.1. Model specification. The MESA data consist of 6672 participants 44 
to 84 years old (after removing missing values), among which 3343 (50.1%) 
have zero Agatston scores of CAC. We use log(CAC + 1) as the response 
variable (the log-plus-one transformation is commonly used in many ap- 
plications to avoid long tails and preserve the zeroes), and the covariates 
include gender (0- female, 1-male), race (0-Caucasian, 1-Chinese, 2- African 
American, 3-Hispanic), diabetes mellitus (0-normal, 1-otherwise) , cigarette 
smoking status (0-never, 1-former, 2-current), age, body mass index (BMI), 
diastolic blood pressure (DBP), systolic blood pressure (SBP), high-density 
lipoprotein (HDL) cholesterol and low-density lipoprotein (LDL) cholesterol, 
of which the first four could be treated as factor predictors, and the rest are 
continuous variables. Because approximately half of the CAC scores are ze- 
roes, while the remaining are positive and continuously distributed, we adopt 
a semiparametric zero- inflated normal regression model for the response vari- 
able Y = log(CAC + 1) (see Figure 4), with the conditional response distri- 
bution as specified by (2.1). The covariate effect of BMI was found to be 
linear in a preliminary analysis, hence, it was modeled as a parametric term. 
There was only slight interaction between HDL and LDL cholesterol levels, 
so they were modeled additively for ease of interpretation. As a consequence, 
the probability of having positive CAC is linked via the logit function (it is 




Fig. 4. Histogram of log (CAC + 1) from MESA. 
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referred to as logistic part and henceforth) to the covariates as follows: 
logit(p) = (3q + (3\Male + fa Chinese + f3 3 Black + ^Hispanic 

(4.1) + /3 5 Cig f + /3 6 Cig c + /3 7 DM + faBMI 

+ hiAge) + h 2 {DBP) + h 3 (SBP) + h A (HDL) + h 5 (LDL), 

where /3o to (3s are the regression coefficients associated with the parametric 
terms; DM stands for abnormal diabetes mellitus status; Cig j and Cig c are 
binary indicators of former and current smoker respectively; hi, i = 1, . . . , 5, 
are unknown smooth functions. The mean positive (transformed) CAC level 
is specified as follows (linear part): 

A* = 7o + "f±Male + 72 Chinese + 73 Black + "f^Hispanic 

(4.2) + 75 Cig f + j 6 Cig c + ^ 7 DM + 78 BMI 

+ sx(Age) + s 2 {DBP) + s 3 (SBP) + s 4 (HDL) + s 5 (LDL), 

where 70 to 78 are regression coefficients, Sj, i = 1, . . . , 5, are smooth functions 
possibly distinct from hi. All univariate smooth functions in the logistic and 
linear parts above were estimated nonparametrically using cubic regression 
splines with shrinkage and nine evenly spaced knots to identify important 
risk factors, as discussed in Section 2. 

The unconstrained ZIN models (4.1) and (4.2) assume that the covariate 
effects on the probability of having a positive CAC score and the mean pos- 
itive (transformed) CAC may be driven by different physiological processes. 
As discussed earlier, a partially constrained zero-inflated model could be 
used to test whether the two processes are related to some extent. For ex- 
ample, we can add a proportional constraint h\ = 5\S\ to examine whether 
age acts in a similar manner in affecting CAC from zero to positive, and from 
small amount to large amount. By comparing the fitted partially constrained 
and unconstrained models based on their cross-validated likelihoods, we can 
properly address interesting scientific hypotheses as above, which may help 
elucidate the biological process responsible for CAC development. 

Table 1 lists some partially constrained and unconstrained ZIN models fit- 
ted to the MESA data, with corresponding cross-validated (log-)likelihood, 
AUC, MSE and MSE C estimated from B = 200 replications (given that the 
sample size is considerably larger than that in the simulation study) of 
MCCV with equal size of training set and validation set {y = 0.5). The 
DBP effect was found to be completely eliminated in both the logistic and 
linear parts and, hence, it was treated as an unconstrained component. We 
did not include models with constraints on the HDL and/or LDL compo- 
nents due to convergence problem. This suggests that the HDL and LDL 
effects are likely to be very different in the two processes, namely, the ab- 
sence/presence of CAC and the level of CAC when it is present, such that 
forcing them to be proportional on the link scales will cause numerical prob- 
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Table 1 

Comparison of candidate zero-inflated normal models fitted to the MESA data based on 
Monte Carlo cross-validation with B = 200 and v — 0.5. V " denotes the proportional 
constrained smooth component; "x " denotes the unconstrained smooth component 



Model 


Age 


DBP 


SBP 


HDL 


LDL 


loglik 


AUC 


MSE 


MSE C 


Mi 


X 


X 


X 


X 


X 


-5018.5 


0.79 


2.59 


4.38 


M 2 


/ 


X 


X 


X 


X 


-5034.9 


0.78 


2.60 


4.42 


M 3 


X 


X 


✓ 


X 


X 


-5022.3 


0.79 


2.61 


4.38 


M 4 


/ 


X 


/ 


X 


X 


-5034.6 


0.78 


2.60 


4.42 



lems in the estimation. Therefore, we considered four candidate semipara- 
metric zero-inflated models in the MESA data analysis: M\ — no constraint 
imposed, Mi — proportional constraint on age, M3 — proportional constraint 
on SBP, and M4 — proportional constraints on both age and SBP (with dif- 
ferent proportionality parameters). 

According to the cross-validated likelihood, the unconstrained ZIN mod- 
el M\ has the best prediction performance among all candidate models. The 
second best model is the constrained model M3, which imposes a propor- 
tional constraint on the SBP component (estimated proportionality parame- 
ter is 0.682 with standard deviation 0.157). Note that all other three criteria, 
that is, AUC, MSE, and MSE C , are very close, especially for M\ and M3. This 
is expected because, as discussed in the end of Section 3, the discrepancies 
between these criteria would be very small with large sample size. In fact, 
the AUC and MSE C criteria (which are two reasonably reliable measures as 
demonstrated in the simulation study) of M\ and M3 are so close that it 
is hard to discern any differences. However, there is still some gain in the 
cross-validated likelihood by fitting an unconstrained ZIN as compared to 
the constrained models. We also tried other values of the fraction v between 
0.5 and 0.85 with more data in the validation set (y = 0.85 corresponds to 
1000 samples in the training set) to assess the robustness of the likelihood 
cross-validation procedure. The unconstrained model was consistently se- 
lected under various sizes of the validation set. Therefore, according to the 
prediction performance using cross-validation, the unconstrained model per- 
forms better than the partially constrained models, which suggests that the 
covariates act differently in predicting the presence of positive CAC and its 
severity when it is positive. The above result is significantly different from 
existing studies, including Han and Kronmal (2006), Ma et al. (2010) and Liu 
et al. (2011) in the determination of proportional constraint in zero-inflated 
models of CAC score in MESA. 

4.2. Analytical results. We now present the results of the fitted uncon- 
strained semiparametric zero-inflated model to the MESA data, as selected 
by the model selection procedure based on likelihood cross-validation. Ta- 
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Table 2 

Coefficient estimates of the fitted unconstrained zero-inflated normal model defined by 

equations (4-1) ond (4-8) 







Logistic 






Linear 




Estimate 


SE 


p- value 


Estimate 


SE 


p- value 


Intercept 


-1.13 


0.18 


<0.001 


3.46 


0.19 


<0.001 


Male 


0.91 


0.06 


<0.001 


0.67 


0.06 


<0.001 


Chinese 


-0.13 


0.10 


0.209 


-0.28 


0.10 


0.004 


African 


-0.79 


0.07 


<0.001 


-0.37 


0.07 


<0.001 


Hispanic 


-0.63 


0.08 


<0.001 


-0.34 


0.08 


<0.001 


DM 


0.25 


0.07 


<0.001 


0.27 


0.06 


<0.001 


Cig f 


0.37 


0.06 


<0.001 


0.19 


0.06 


0.002 


Cig c 


0.61 


0.09 


<0.001 


0.31 


0.09 


<0.001 


BMI 


0.03 


0.01 


<0.001 


0.02 


0.01 


0.002 



ble 2 lists the coefficient estimates of the parametric components. The model 
results suggest that men have increased risk of having positive CAC and 
higher mean CAC score when it is present, as compared to women. Both 
African and Hispanic Americans have reduced probability of having CAC, 
as compared with Caucasian. Chinese, African and Hispanic Americans all 
have lower average CAC level when it is positive. Having abnormal diabetes 
status will increase both the risk of positive CAC and its progression. Former 
smokers are more likely to have CAC and, on the average, they have higher 
positive CAC scores, as compared with nonsmokers. Current smokers have 
even higher risk and mean positive CAC level. BMI is linearly positively 
associated with both the probability of having positive CAC (on the link 
scale) and CAC score if it is positive. 

Among other related MESA studies, Han and Kronmal (2006) included 
only gender, race and age as the covariates, and their parameter estimates 
are similar to ours in signs and magnitudes, except that they found Chi- 
nese had significantly reduced risk of having positive CAC, as compared to 
Caucasian. The above findings on the parametric components are consistent 
with the unconstrained two-part model in Ma et al. (2010). 

The estimated nonparametric smooth functions are displayed in Figure 5. 
Table 3 summarizes the significance test results of the nonparametric terms 
based on F tests with the null hypothesis that the smooth function is iden- 
tically zero over the observed domain. Age is positively related to both the 
probability of positive CAC (p < 0.001) and positive CAC level (p < 0.001). 
However, its effect in the positive mean response fx shows some curvature 
at the right tail, suggesting that the age effect is not as strong in the old as 
in mid-aged people. Elevated systolic blood pressure is associated with in- 
creased risk of having positive CAC (p < 0.001) and higher CAC score when 
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Logistic Logistic Logistic Logistic 




Age SBP HDL LDL 

Fig. 5. Nonparametric smooth function estimates of the fitted unconstrained zero-inflated 
normal model defined by equations (4-1) an d (4-2)- The dashed lines constitute 95% point- 
wise confidence bands. The DBP effects are estimated to be zero in both the logistic and 
linear parts (not shown). 

it is present (p = 0.003). Its effect on the presence of CAC is nonlinear on the 
logistic scale, whereas it is almost linear in the positive mean response part. 
The probability of having positive CAC decreases as the HDL cholesterol 
level increases up to around 60 mg/dL, beyond which the risk then stays 
stable (p < 0.001). Among the participants who have positive CAC scores, 
those with HDL between 40 to 60 mg/dL were observed to have lower CAC 
levels, however, its influence is not statistically significant (p = 0.128). LDL 
is a pronounced risk factor of CAC initiation (p < 0.001). Nevertheless, the 
LDL effect on the extent and severity of CAC when it is positive is com- 
pletely eliminated by the shrinkaged cubic spline. The same was observed as 
to the diastolic blood pressure effects in both logistic and linear parts (not 
shown in Figure 5). 

This study may significantly differ from published MESA studies con- 
cerning the nonparametric covariate effects along the following perspectives. 
First, the unconstrained semiparametric zero-inflated model of the Agatston 
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Table 3 

Non-parametric smooth function estimates of the fitted unconstrained zero-inflated 
normal model defined by equations (4-1) and (4-2). EDF stands for effective 
degrees of freedom. The p-values are based on F tests for significance 







Logistic 






Linear 




EDF 


F statistic 


p- value 


EDF 


F statistic 


p- value 


s(Age) 


2.4 


816.6 


<0.001 


2.7 


116.7 


<0.001 


s(DBP) 


NA 


NA 


NA 


NA 


NA 


NA 


s(SBP) 


1.7 


31.1 


<0.001 


1.0 


7.2 


0.003 


s(HDL) 


3.0 


19.5 


<0.001 


2.8 


1.8 


0.128 


s(LDL) 


2.2 


38.0 


<0.001 


NA 


NA 


NA 



score was found to have the best prediction performance based on the like- 
lihood cross-validation procedure. Second, the age effect on the magnitude 
of CAC when it is positive, and the systolic blood pressure influence on the 
probability of having positive CAC, were both observed to be nonlinear. 
Third, LDL was shown to have no effect in predicting CAC level among 
those with positive CAC. And last, diastolic blood pressure was found not 
to be a risk factor in human CAC development by the cubic regression spline 
with shrinkage adopted in our study. 

5. Discussion and conclusion. We have presented a highly flexible semi- 
parametric regression model for analyzing zero-inflated data. Possible partial 
proportional constraints, whose biological interpretation could be traced to 
some latent threshold model under a possibly misspecified link function, 
were considered to promote estimation efficiency and help to reveal the con- 
nection between the zero and nonzero data generating processes. In order 
to choose the optimal model specification among multiple candidate mod- 
els with various partial constraints, a model selection procedure based on 
cross-validated likelihood was used, which was empirically corroborated by 
a simulation study. The proposed partially constrained zero-inflated model 
framework makes it possible to provide evidence-based justification to ad- 
dress research questions concerning the underlying mechanisms that drive 
the presence and magnitude of the nonzero response. In particular, it can 
be used to identify closely related covariate effects in the zero and nonzero 
data generating processes. We have adopted the cubic regression spline with 
shrinkage to estimate nonparametric smooth functions and select relevant 
variables simultaneously, which works well empirically in both simulation 
and real data application. However, its theoretical properties still need to 
be investigated in the future. 

When applied to the MESA data analysis, the semiparametric zero-inflated 
modeling approach indicates that the initiation of calcium in the human 
coronary artery and the magnitude of positive calcium (measured by Agat- 
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ston score) in the general population are better characterized by an un- 
constrained zero-inflated model. It is statistically justified that the initia- 
tors of coronary artery disease may be different from the factors that are 
related to extent and progression of the disease which is reflected by the 
amount of CAC in those with positive CAC scores. In particular, age and 
systolic blood pressure are both risk factors in influencing the development 
of CAC from zero to positive, and from small to large amount. But their 
effects show some extent of nonlinearity at certain stages. HDL and LDL 
cholesterol levels both have pronounced nonlinear effects in predicting the 
presence of CAC. However, only HDL has some impact (not statistically 
significant) on the extent of CAC in those who have positive CAC scores. 
These results may reflect the fact that the biological mechanisms underlying 
the initiation and progression of CAC are somehow different. The partially 
constrained semiparametric zero-inflated modeling approach (including the 
unconstrained case) with the model selection procedure based on likelihood 
cross-validation can be applied widely to complex data analysis with the 
zero-inflation problem. 
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